This dashboard analyzes global development trends from 2000 to 2022 using World Bank indicators across multiple different countries and a range of socioeconomic and infrastructure-related variables. We investigate this data through different research questions using both regression and classification techniques. Our analysis spans GDP per capita, income inequality (Gini index), unemployment, and others that will be discussed and addressed in this dashboard. We apply both regression and classification techniques to analyze this data, such as; Multiple Regression (non-linear), Ridge Regression fit, LOESS fit, kNN classification, Naive Bayes Classification, and Logistic Regression. Our results will be visualized using a variety of maps and plots that offer a clear insight to the global patterns and by comparison; answering our respective research questions.
This project leverages socioeconomic and infrastructure indicators from the World Bank, spanning the years 2000 to 2022, to investigate global development patterns across a diverse set of countries. The dataset includes economic measures such as GDP per capita, inflation, and unemployment, as well as social indicators like life expectancy, school enrollment, and health expenditure. Additionally, it incorporates infrastructure and technology access metrics, such as electricity access, internet usage, and mobile phone subscriptions.
Using this data, we explore research questions related to the economic growth. Our methods include ridge regression to understand linear relationships, LOESS smoothing to uncover non-linear trends over time, and logistic regression for classification tasks—particularly focusing on modeling digital inclusion. The dashboard offers a visual and interactive means of understanding these complex, multidimensional relationships. The data used can be found here: https://drive.google.com/drive/u/1/folders/16j7E2yBUDPmfGM00o7rGVGaCsE9BYhnK. Additionally you can download the data yourself from the world bank website here: https://databank.worldbank.org/source/world-development-indicators.
| Variable.Name | Description |
|---|---|
| Country Name | Full name of the country |
| Country Code | Three-letter ISO country code |
| Year | Calendar year of the observation |
| GDP per capita (current US) | Gross Domestic Product divided by midyear population, in current U.S. dollars |
| Gini index | Measure of income inequality (0 = perfect equality, 100 = perfect inequality) |
| Unemployment, total (% of total labor force) | Percentage of total labor force that is unemployed (national estimate) |
| Inflation, consumer prices (annual %) | Annual percentage change in consumer prices |
| Exports of goods and services (% of GDP) | Total exports as a percentage of GDP |
| Gross capital formation (% of GDP) | Investment in fixed assets plus net changes in inventories |
| Life expectancy at birth, total (years) | Average number of years a newborn is expected to live |
| School enrollment, tertiary (% gross) | Gross enrollment ratio in tertiary education |
| Current health expenditure per capita (current US$) | Per capita expenditure on healthcare, in current U.S. dollars |
| Population growth (annual %) | Annual population growth rate |
| Access to electricity (% of population) | Percentage of the population with access to electricity |
| Individuals using the Internet (% of population) | Percentage of individuals who use the Internet |
| Mobile cellular subscriptions (per 100 people) | Number of mobile subscriptions per 100 people |
| Urban population (% of total population) | Percentage of total population living in urban areas |
Can we predict GDP per capita of a country in USD from total enrollment in tertiary education, how much a country spend on healthcare per capita, and the percentage of the population that lives in an urban area?
Looking at the matrix plot, the strongest relationship with GDP per capita is clearly health expenditure—there’s a tight, upward trend in the scatterplot, and the correlation is really high at 0.944. School enrollment and urban population also show positive relationships with GDP, but they’re not as strong. The points are more spread out and the correlations are a lot lower.
The distributions for GDP and health expenditure are both heavily skewed, with a lot of values clustered on the low end and a long tail of high values. That suggests we should log-transform those two variables to make the relationships more linear and improve the model fit. Urban population is a percentage and only mildly skewed, but the scatterplot with GDP still shows some curvature. In that case, a square root transformation is a better choice than log since it’s more appropriate for percentage-based variables and makes the relationship more linear without distorting the scale too much.
After applying the transformations, the relationships between the
variables look a lot more linear. The log transformation on GDP and
health expenditure really helped tighten up the scatterplots, especially
between log_gdp and log_health, which now
shows an extremely strong, nearly perfect linear trend (Corr: 0.982).
The correlation between GDP and the other two predictors—school
enrollment and urban population also improved. School is now at 0.602,
and the transformed urban variable (sqrt_urban) is at
0.722.
The density plots also look a lot better. log_gdp and
log_health are now more normally distributed, and while
sqrt_urban is still a little skewed, it’s a definite
improvement over the original scale. The scatterplots overall look
tighter and more consistent, which means the assumptions for multiple
regression hold.
Call:
lm(formula = log_gdp ~ school + log_health + sqrt_urban, data = train_mlr)
Residuals:
Min 1Q Median 3Q Max
-1.21562 -0.13358 -0.01252 0.14266 0.98148
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.8304094 0.0699193 54.783 <2e-16 ***
school -0.0003958 0.0005025 -0.788 0.431
log_health 0.8383025 0.0093836 89.337 <2e-16 ***
sqrt_urban -0.0082864 0.0119256 -0.695 0.487
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2519 on 780 degrees of freedom
Multiple R-squared: 0.9637, Adjusted R-squared: 0.9636
F-statistic: 6901 on 3 and 780 DF, p-value: < 2.2e-16
The model explains about 96% of the variation in \(\log\) GDP per capita, which means the three predictors together do a very good job of accounting for differences in GDP across countries. The F-test result shows that the overall model is statistically significant.
Looking at the coefficients, log_health has a strong positive effect
on GDP, while school has a very small negative effect. The coefficient
for sqrt_urban is also close to zero, suggesting it doesn’t
have much impact in this model.
The residuals are generally well spread out across the range of fitted values, which suggests that the model meets the assumption of constant variance. There’s no clear curvature or funnel shape, so the linearity and homoscedasticity assumptions appear to hold. There is definitely an outlier in the bottom left.
That said, there is some bunching of points between fitted values of 9 and 11, where the residuals seem to cluster more tightly around zero. This might reflect a large number of countries with similar predicted GDP values that the model was able to predict very accurately.
Horizontal dashed line represents the leverage cutoff, calculated by \(3(p/n)\). Many of the points are below the the cutoff. However, there are some points that exceed the line and have influence in pulling the regression line. Further research would need to be done to determine if the points are accurate.
Horizontal dashed line represents the \(4/(n-p-1)\) cut off value. There are a good bit of data points or spikes below or around the cutoff line, but there are a few that exceed the cutoff value. There are two points that reach 0.15 and about 0.07 that stand out as high influence points. Further investigation would be needed to see if these points are accurate or of importance for our model.
The studentized residuals follow the QQ plot pretty closely along the 45 degree line. The data points are right along the line in the middle of the plot and the tails deviate a little, but overall the plot is good and the data follows the normality assumption.
In order to test our multiple regression model, we split our data into a 70-30 training-testing split.
| Metric | Value |
|---|---|
| Root Mean Squared Error (RMSE) | 0.2351 |
Yes, we can predict GDP per capita of a country from total enrollment in tertiary education, how much a country spend on healthcare per capita, and the percentage of the population that lives in an urban area. In fact our model has a very good R^2 value and does a good job at this task. Overall the linear regression assumptions were upheld, but further research could be done on the problematic leverage and influence points. Based on the model’s performance, it seems like linear regression is fine for this task but further work could be done using robust regression methods to deal with the leverage and influential points.
To what extent can a country’s GDP per capita be predicted by its inflation rate, unemployment, export activity and capital formation?
Why use Ridge: You have multiple potentially correlated predictors (e.g., inflation and unemployment) so Ridge helps control for multicollinearity while identify which economic indicators best explain GDP per capita variation.
In the Ridge Regression the red dots show the mean squared error that we found at different values of log(lambda) during cross-validation. The blue vertical line marks the optimal lambda found. This lamdba minimizes the prediction error. The model performs best with relatively low regularization penalty, indicating that the predictors contribute useful information and multicollinearity is present but not severe. The curve is smooth and U-shaped, showing the trade-off between underfitting and overfitting.
After a log-transforming GDP per capita, the Ridge Regression model shows a more stable and linear relationship to our data. The predictions are close to the actual values amd suggest that the transformation corrected the skewness we had previously. The model captures the overall trend well.
We aim to use LOESS to predict a country’s GDP per capita for several
countries. We use both degree-1 and degree-2 polynomials to compare how
good the fit was. This is ultimately determined by comparing their MSEs.
A lower MSE indicates a better fit, but we need to take care of
overfitting issues. In all analyses, the span is set to
0.5.
We chose 6 countries for the analyses. The full data range is from 2000 to 2021, but some countries may have data missing for some years.
| degree | SSE | MSE |
|---|---|---|
| 1 | 26000456 | 1181838.9 |
| 2 | 10837696 | 492622.5 |
| degree | SSE | MSE |
|---|---|---|
| 1 | 13101.512 | 595.5233 |
| 2 | 8579.084 | 389.9584 |
| degree | SSE | MSE |
|---|---|---|
| 1 | 176695.5 | 13591.958 |
| 2 | 63067.9 | 4851.377 |
| degree | SSE | MSE |
|---|---|---|
| 1 | 1.638410e+07 | 7.447320e+05 |
| 2 | 6.172062e+01 | 2.805483e+00 |
| degree | SSE | MSE |
|---|---|---|
| 1 | 747.0884 | 33.95856 |
| 2 | 10663.7922 | 484.71783 |
| degree | SSE | MSE |
|---|---|---|
| 1 | 127493.15 | 6710.166 |
| 2 | 17296.14 | 910.323 |
From the analyses above, we can see that LOESS tends to have greater
MSE when there is more fluctuation in a country’s growth, as in the case
for the UK. For Uruguay and China, which showed a more graduate growth,
the MSE is lower. Usually, we expect a degree-2 fit gave a lower MSE
than a degree-1 fit, but for some countries like Indonesia and France, a
degree-2 LOESS yielded a higher MSE than degree-1. The analyses here use
a span of 0.5, but we should try out different
span values to find out the optimal fit for each data.
However, while a lower MSE indicates a good fit, we must consider the
problem of overfitting. This ensures that when we introduce new data,
LOESS can capture most of the features.
Can we classify countries into high vs. low internet usage based on economic indicators such as GDP per capita of a country, total enrollment in tertiary education, how much a country spend on healthcare per capita, and the percentage of the population that lives in an urban area?
For this research question we will create a binary variable based on if internet usage is greater than the median internet usage value. We will also prepare our data using min-max normalization. Lastly, we will divide the into a 70-30 training-testing split.
We will use the knn function from the
class library. Additionally, we will test k values 1
through 10, to see which performs best.
| k | Accuracy | Error |
|---|---|---|
| 1 | 0.7515 | 0.2485 |
| 2 | 0.7784 | 0.2216 |
| 3 | 0.7874 | 0.2126 |
| 4 | 0.7844 | 0.2156 |
| 5 | 0.7874 | 0.2126 |
| 6 | 0.7665 | 0.2335 |
| 7 | 0.8144 | 0.1856 |
| 8 | 0.8114 | 0.1886 |
| 9 | 0.8144 | 0.1856 |
| 10 | 0.7904 | 0.2096 |
As seen on the plot and table above, \(k=73\) and \(k=9\) are equal so we will just use \(k=7\) as the best \(k\) value to choose.
| High | Low | |
|---|---|---|
| High | 128 | 23 |
| Low | 39 | 144 |
| Metric | Value |
|---|---|
| Accuracy | 0.8144 |
| Error Rate | 0.1856 |
Yes, we can classify countries into high vs. low internet usage based on economic indicators such as GDP per capita of a country, total enrollment in tertiary education, how much a country spend on healthcare per capita, and the percentage of the population that lives in an urban area? Additionally, as seen in the analysis \(k=3\) performs the best for this task.
Can we classify countries into high vs. low life expectancy based on development statistics like percent of the population that lives in an urban area, how much a country spend on healthcare per capita, and percent of the population who have access to electricity?
For this research question, we will create a binary variable for high and low life expectancy based on if the life expectancy of a country is greater than the median life expectancy from the data. Additionally, we will create a 70-30 training-testing split.
| High | Low | |
|---|---|---|
| High | 157 | 83 |
| Low | 10 | 84 |
| Metric | Value |
|---|---|
| Accuracy | 0.7216 |
| Error Rate | 0.2784 |
This plot shows how confident the model is in predicting high life expectancy. It does really a really good job predicting high life expectancy countries, but is less confident on low life expectancy ones.
Yes, we can classify countries into high vs. low internet usage based on economic indicators such as GDP per capita of a country, total enrollment in tertiary education, how much a country spend on healthcare per capita, and the percentage of the population that lives in an urban area. As seen in our analysis above, the Naive Bayes model performs very well on this task.
How do life expectancy, income inequality, and unemployment influence the likelihood of a country being under developed?
For this research question, we will define underdeveloped countries as those with life expectancy less than 70, and use this to create our classes.
Call:
glm(formula = low_expectancy ~ ., family = "binomial", data = logit_data)
Coefficients:
Estimate
(Intercept) 3.801e+00
`Unemployment, total (% of total labor force) (national estimate)` -1.079e-01
`Gini index` -5.121e-02
`GDP per capita (current US$)` -6.504e-04
Std. Error
(Intercept) 6.835e-01
`Unemployment, total (% of total labor force) (national estimate)` 3.090e-02
`Gini index` 1.552e-02
`GDP per capita (current US$)` 7.284e-05
z value
(Intercept) 5.562
`Unemployment, total (% of total labor force) (national estimate)` -3.493
`Gini index` -3.300
`GDP per capita (current US$)` -8.929
Pr(>|z|)
(Intercept) 2.67e-08 ***
`Unemployment, total (% of total labor force) (national estimate)` 0.000478 ***
`Gini index` 0.000965 ***
`GDP per capita (current US$)` < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 770.24 on 1115 degrees of freedom
Residual deviance: 368.99 on 1112 degrees of freedom
AIC: 376.99
Number of Fisher Scoring iterations: 10
Conclusion: When evaluating Life expectancy to income inequality and unemployment rate, we can see there is a distinct downward trend. What are we seeing here? Well the higher the unemployment rate the lower the life expectancy is. We also see this when we incorporate income inequality and GDP per capita that our graph eventually normalizes where we assume it is at a lower life expectancy than 70 years.
We provided a separate graph to see how life expectancy compares to GDP per Capita and our other factors (unemployment and Gini index). We can see that the predictability of having lower GDP to higher unemployment rate, and a greater income inequality produces a higher probability of a country having a life expectancy below 70 years.